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commonly used bill and invoice formats, such as handwritten 
notes and printed documents, for the purpose of automating 
database updates. The method developed in this study 
provides a quick and reliable means of extracting text from 
the bill. In this study, we will combine deep learning with 
Optical Character Recognition to analyse text in images and 
scanned documents and convert it into machine-readable text. 
After that, we use OpenCV to pull the text from the image, and 
then run it through our algorithm to ensure the highest level 
of accuracy possible before automatically saving the new text 
to our database. 
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Introduction 


Image processing, including the processing of invoices and handwritten bills, has become indispensable in 
all areas of business today. Since doing so would be a waste of time, we can't enter the information manually 
into a specific format [5]. Although the conventional method involved extracting text from photos, it also 
uncovered a number of related issues [6]. Having spent some effort on extraction, I can attest to the 
difficulty of accurately transforming an image to text [7]. They play a crucial role in the exchange of textual 
information between human and computer [8-12]. In order to get text out of photos, this study employs 
OpenCV and Nanonet. Recognizing text in natural photographs remains a difficult task [13]. This study 
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provides an in-depth look at text recognition for the purposes of extracting data from commonly used bill 
and invoice formats, such as handwritten notes and printed documents, for the purpose of automating 
database updates [14-17]. The method developed in this study provides a quick and reliable means of 
extracting text from the bill. In this paper, we'll discuss how to incorporate deep learning into text detection 
and extraction by first using optical character recognition to convert human-unreadable text from an image 
or scanned document into machine-readable form, then using Open CV to extract the text from the image, 
and finally using our algorithm to get the most accurate results possible so that the text can be added to the 
database in an automated fashion [18]. Because there is no need for human labour, problems like incorrect 
data entry can be avoided [19]. Information extracted from scanned documents is thus highly accurate. One 
of the most significant advantages of OCR data entry methods is that it allows organisations to cut down on 
the number of people they need to hire to perform data extraction [20]. Obtaining the highest possible degree 
of precision in retrieving search results is also achieved [21-24]. 


In recent years, deep learning has garnered an incredible amount of attention. The most common and well- 
respected kind of images [25]. CNNs have found widespread use in fields such as object detection and facial 
recognition. Image classifications using CNNs begin with an input image, process it, and then assign it to 
one of many predetermined categories (Eg ., Dog, Cat, Tiger, Lion) [26]. A computer interprets a picture as 
a grid of pixels, the exact number of which is dependent on the image's resolution. Depending on the image's 
pixel dimensions (h = height, w = width, d = dimension), it will interpret these values as h x w x d. 
Examples include a matrix picture with dimensions of 6 by 6 by 3 with RGB values indicated by the number 
3 and a matrix image with dimensions of 4 by 4 by 1 with grayscale values indicated by the number 1. To 
identify an item with probabilistic values between 0 and 1, deep learning CNN models train and test each 
input image by running it through a sequence of convolution layers with filters (Kernals), Pooling layers, 
and fully connected layers (FC), and then applying the Softmax function [27-34]. 


A Precise and Effective Setting the Text Detector approach uses deep learning to reliably identify text. In 
fact, it's the only known method for detecting text, therefore it deserves special attention [35]. It is also 
compatible with any other text recognition technique [36]. In order to reduce the number of unnecessary 
steps, this paper's text detection pipeline is broken down into only two distinct phases [37-42]. Specifically, 
one can use the fully convolutional network to generate word or text line-level prediction [43]. After the 
non-maximum suppression stage, the output is formed from the produced predictions, which are typically 
rotated rectangles or quadrangles. EAST is capable of identifying text in both still photos and moving 
footage. According to the paper's claims, it can recognise text in 720p images at 13 frames per second while 
maintaining a high level of accuracy [44]. Furthermore, this method is already implemented in OpenCV 
3.4.2 and OpenCV 4, which is a huge plus. This EAST model will be demonstrated alongside text 
recognition [45-49]. 


The EAST algorithm makes predictions at the word or line level using a single neural network. It is able to 
recognise text in any direction by looking for quadrilateral shapes [50]. As of 2017, this algorithm performed 
better than the best available approaches. The core of this approach is a fully-convolutional network that 
merges using non-max suppression (NMS) [51]. In order to pinpoint text in an image, a fully convolutional 
network is employed; this NMS stage then unifies the several text boxes recognised by the network into a 
single bounding box for each text location (word or line text) [52-55]. The EAST framework was developed 
with the varying sizes of the world's regions in mind [56]. The goal was to identify large word regions using 
features from the final stage of the neural network, while identifying small regions using features from the 
first stage [57]. Rather than using a single neural network, the scientists have integrated three different neural 
networks into one [58]. 
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Figure 1: "EAST framework [1] 
Feature extractor stem: 


This section of the network gathers information from several different nodes. The presented item may be a 
convolutional network that has already been trained using the ImageNet dataset [59]. The authors of the 
EAST framework conducted their experiment using both PVANet and VGG16. Only the EAST configuration 
of the VGG16 network will be shown in this weblog [60-63]. Let's take a look inside the VGG16 model and 
see how it's put together [64]. 
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Figure 2: VGG16 12] 


This section of the network gathers information from several different nodes [65]. The presented item may 
be a convolutional network that has already been trained using the ImageNet dataset. The authors of the 
EAST framework conducted their experiment using both PVANet and VGG16 [66]. Only the EAST 
configuration of the VGG16 network will be shown in this weblog [67-71]. Let's take a look inside the 
VGG16 model and see how it's put together. A score map plus a geometry map make up the output layer. 
The chance of finding text in a given area is represented by the score map, while the text box itself is 
delineated by the geometry map [72]. It's up to the user to decide whether this geometry map is a box or a 
square with a rotation. We can define a rotated box with its top-left coordinate, width, height, and rotation 
angle. A quadrangle, on the other hand, includes all four corners of a rectangle [73-81]. 


Loss Function 


This EAST technique employs a loss function that combines a score map loss with a geometry loss function 
[82]. 


L=L,+Agl, 


According to the aforementioned calculation, both losses are aggregated using a weight of. Use this symbol to 
quantify the severity of various forms of financial setback [83-85]. A number of authors in the EAST paper 
have interpreted this as 1. Phase of Convergent Suppression Below Maximum Geometries predicted by a 
fully convolutional network that have been subjected to a threshold. An NMS that takes local context into 
account is then used to suppress any remaining geometries. Nave NMS is implemented in O. (n2) [86]. 
Although this can be executed in O(n), the authors chose for an approach that employs row-by-row 
suppression to achieve that time limit. The last merged row is taken into account as part of this row-by-row 
suppression process [87]. The worst-case time complexity of this method is still O, despite the fact that it is 
fast in most situations (n2) [88]. 


Recurrent Neural Network 


Short for "Recurrent Neural Network," RNNs are a special kind of neural network that finds extensive 
application in NLP [89]. Assuming that two successive inputs are independent, a generic neural network 
processes an input through some layers and generates an output. However, in some real-world situations, this 
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assumption proves to be false [90]. For example, while trying to forecast the stock price at a specific moment 
or the next word in a series, one must take into account the importance of prior data [91-93]. The output of an 
RNN is dependent on the results of earlier calculations, hence the name "recurrent." This is because the RNN 
performs the same task for each sequence member. Alternatively, RNNs can be thought of as having a 
"memory" that stores data about previous calculations. While RNNs have the potential to utilise data from 
infinitely lengthy sequences, in practise they can only retrace their steps a small number of times [94]. The 
RNN version LSTM, which can look back further than the original RNN, has proven to be quite effective in 
the field of natural language processing. People's minds don't reset themselves every single second. The 
words in this essay build upon one another as you read them. You don't just toss everything out and start over. 
It's clear that your thoughts are tenacious. This is a key limitation of conventional neural networks [95-101]. 
Let's say, for argument's sake, you're interested in labelling every scene in a film according to the type of 
event occurring there. It's not apparent how a classic neural network would be able to reason about what 
happened before in the movie and apply that knowledge to the present. This problem is solved by recurrent 
neural networks. They are recursive networks where data can be stored indefinitely [102]. 


The recurrent neural network has loops 


The node AA in the above diagram receives the input xtxt and returns the result htht. A network's ability to 
transfer data from one stage to the next is greatly facilitated by the inclusion of a loop. These iterative 
processes provide a mysterious air to recurrent neural networks [103-104]. But if you give it some more 
thought, you'll realise they're not so different from a regular neural network. One way to conceptualise a 
recurrent neural network is as a collection of identical networks that each communicate with their successors 
[105]. This chainlike structure demonstrates that recurrent neural networks are closely linked to sequences 
and lists. They are the most intuitive neural network architecture for this kind of information [106]. And they 
are put to good use. Speech recognition, language modelling, translation, and image captioning are just a few 
examples of areas where RNNs have been applied with great success in recent years. These achievements 
owe a great deal to the deployment of "LSTMs," a subset of recurrent neural networks that excels where the 
classic form fails [107-112]. Almost all of the fascinating outcomes of recurrent neural networks are 
accomplished by using them. This paper will focus on these long short-term memories [113]. 


LSTM Networks 


LSTMs, short for long short-term memory networks, are a subclass of RNNs that can learn complex 
dependencies over time. Hochreiter and Schmidhuber (1997) introduced them, and many others have since 
improved upon and made them more widely used. | They're extensively employed since they're effective 
across a wide range of issues. To circumvent this issue, LSTMs are built to be independent for as long as 
possible [114-119]. They don't have to work hard to remember things for extended periods of time; it's just 
how they naturally behave. It's possible to think of a recurrent neural network as a series of smaller neural 
networks connected together [120]. This recurrent module in conventional RNNs will typically consist of a 
single tanh layer, among other things [121]. 


Figure 3: The repeating module in a standard RNN contains a single layer [3] 


Similarly, LSTMs have this chain structure, but the repeating module is organised differently. An unusual 
four-layer neural network structure is used instead of a single layer (fig.4). 
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Neural Network Pointwise Vector 
Layer Operation Transfer 


Figure 4: The repeating module in an LSTM contains four interactinglayers [4] 


Concatenate Copy 


Each line in the preceding diagram conveys a complete vector from the output of one node to the inputs of the 
next nodes [122-125]. Point-wise operations like vector addition are depicted by the pink circles, and the 
yellow boxes represent learnt layers in a neural network [126-131]. When two lines come together, they are 
said to be concatenating, but when one line splits into two, it is said to be forking. The horizontal line at the 
top of the diagram represents the cell state, which is the most important part of LSTMs [132]. The cellular 
state can be compared to a moving conveyor belt. It descends the entire chain in a straight line with little 
linear interactions [133]. Any data can simply be transmitted along it without any modification. By using 
structures called gates, the LSTM can selectively delete or add information to the cell state. For selectively 
allowing data to pass through, we have gates [134]. A sigmoid network layer and a point-wise multiplication 
operation are the constituent parts [135]. The sigmoid layer provides a range of values between 0 and | that 
indicate the level of passthrough for each component. If the value is zero, then nothing is allowed through, 
and if it's one, then everything is. The cellular state in an LSTM is guarded and regulated by three gates. Long 
short-term memories (LSTMs) were a major advancement in RNN capabilities [136-141]. The objective is to 
have the RNN decide at each stage what data to examine from a larger pool. For every word it generates in an 
image caption, for instance, an RNN may consider a different region of the image. Using attention has yielded 
some promising outcomes, and there appears to be much more on the horizon [142]. 


Literature Survey 


New approaches to text extraction yield better outcomes. The literature contains numerous publications that 
have already investigated this general concept, providing a solid foundation for our own. The extraction 
procedure has been greatly improved because to the contributions of numerous scholarly works [143-149]. A 
domain-independent optical character recognition (OCR) tool and a knowledge-based information extraction 
component (FRESCO) make up the system. With an error rate of less than | percent for the items to be 
retrieved, the automation rate for analysing health insurance bills is well above 50 percent [150]. 


System, method, and computer programme product for obtaining data from a telecommunications invoice 
were presented by JA Devolites, R Cybyk, and J Wyatt in 2010. To model the telecommunications invoice 
data stream and map it to a normalised data format, one embodiment of the present invention involves 
receiving a telecommunications invoice data stream in a first data format, analysing the telecommunications 
invoice data stream to determine the first data format, and so on. For example, in one implementation, the 
method may involve: where the modelling of the telecommunications invoice data stream may further involve 
any of creating a model for the first data format; modelling the telecommunications invoice data stream 
according to the model; and/or modelling the telecommunications invoice data stream with an intelligent 
adapter [151]. 


In this study, we introduced an approach to finding useful information in semi-structured papers. Instead of 
using predefined templates, rules, or optical character recognition (OCR)-based engines, this approach 
analyses the morphological composition of documents and the structural link between columns and text rows 
in order to accurately capture their contents. Structured clustering is a two-stage process that analyses the 
document's columns and then clusters rows of text that are comparable together [152-155]. The next step is a 
mechanism that refines the output and combines together not individual text rows but rather groupings of 
text rows [156]. Extra care is taken to properly associate these text rows, as we do not permit group splitting 
in the usual bottom-up method. The method was then put to the test across numerous document categories, 
with encouraging results: a 95.53 percent accuracy rate on the test set. The following phase involved feeding 
the results from the geometric layout analysis into a logical layout analysis to give each of the classes 
discovered in the geometric analysis a logical interpretation. The approach uses a set of (positive) sample 
entities that are taken from either enterprise databases (such as a product catalogue) or annotated documents 
(e.g., historical invoices) [157]. Our key invention is a method for learning powerful regular expressions that 
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are also easy to understand and use [158]. The efficiency comes from a novel approach that compares and 
contrasts dependent entity properties at several levels of granularity (i.e., character and token level) and 
chooses the most appropriate ones to build a regular expression [159]. 


Form processing is now invented. It extracts non-personalized data from filled forms. As such, in accordance 
with some embodiments of the present invention, a form processing method is provided for extracting data 
items from a form that has been digitised in the form of a digital image, the method comprising: prompting a 
user to indicate a location of one or more physical fields, each physical field relating to a data item of a 
specific type; receiving one or more indications provided by the user regarding the location of the physical 
fields; identifying Some preferred implementations of the present invention also include detecting the 
locations of said one or more physical fields and using them to create a template with instructions on where 
to find data items of the same type on forms with a similar layout. The procedure may also entail comparing 
another digital image of a different form to the template and automatically extracting data items using the 
template instructions. In certain implementations, a single click of a pointing device on or near the data item 
in the image indicates the user's physical field location. 


In other embodiments of the present invention, the user taps the image's touch screen or near the data item to 
indicate the physical field's location. PA "VanZyl" (2015) Because it saves time and effort, the proposed 
automatic extraction and handling of invoice document information could benefit many firms. Document 
Analysis and Recognition (DAR) uses OCR to analyse and recognise physical documents to digitally extract 
and process their content. It includes preprocessing, layout analysis, text recognition, and post-processing. 
Pre-processing prepares document images for later processing. Any errors not discovered during pre- 
processing will propagate through the rest of the OCR process, resulting in erroneous recognition. 
Identifying the best effective pre-processing methods for invoice document analysis and recognition will 
benefit relevant academic fields and commercial communities. To solve the problem, exploratory study was 
done. Case studies examined five DAR-related business owners and CEOs. Transcribed semi-structured 
interviews were analysed for recurring themes. Second investigation was experimental. The research used 
invoice document images and a range of pre-processing approaches to evaluate their effectiveness at 
improving recognition rates. The recognition rates of different tactics allowed for quantitative examination. 
DAR companies use the same business model, as discovered. This lengthy study examined industry-standard 
DAR software, ICR methods, and scanning techniques. The widening gap between paper-based information 
and computer processing ensures the industry's future. 


It was effectively demonstrated that some pre-processing procedures work better than others. Results from 
the tests also revealed a wealth of information about the efficacy of various methods. Techniques could be 
compared with respect to their processing times. Time to process could refer either to the time needed to 
apply the technique to the document image, or the time needed to recognise the text in the processed 
document image. A computer-readable storage medium is any device or medium that may store code and/or 
data for use by a computer system, such as the data structures and code described in this system. There are 
many different types of computer-readable storage media, such as magnetic and optical discs, tapes, and 
even paper documents that can be stored in a drawer or a filing cabinet. Disk drives, magnetic tape, CDs, 
DVDs, or any other media capable of storing code and/or data, whether now existing or produced in the 
future. 


Form data extraction, such as that found on bills, is the focus of this disclosure. An illustrative approach 
refers to a particular data type as a Label: value, with the label component equivalent to a text string 
expressing a data label connected with its value component. Although the present thorough discussion is 
directed toward specific embodiments, such as the extraction of information from forms, it is to be 
appreciated that the disclosed technique and methods of extracting information applicable to documents, 
including a tabular arrangement in general. At the outset, information is culled from an image or native 
digital document to determine what can be used for textual and graphical purposes. Next, candidates for 
layout structures are produced utilising text and graphical elements related to the input document, with the 
understanding that a given element may belong to many layout structures. The information labels are then 
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used to tag relevant content. For information to be usable with the disclosed method and system, it must 
have the same label:value structure. Targeted information is a collection of pieces of data that need to be 
extracted so that a best-match assignment may be made between those pieces of data and the label:value 
pairs that have been extracted. Invoice data processing is a subfield of data extraction in which relevant 
information is culled from scanned invoices. Current methods often use templates to help collect certain 
data. Due to the extensive template requirements, however, this method has limited applicability. 


Proposed Work 


Fig. 1 shows an offline HTR system that digitises handwritten text from pictures. We want to train an IAM 
dataset-based Neural Network (NN) to recognise word pictures. Since the input layer and subsequent layers 
may be kept low, CPU NN training is possible for word images (of course, a GPU would be better). This 
solution provides the TF-HTR minimal functionality. NNs are used for this. The design includes 
convolutional, recurrent, and Connectionist Temporal Classification (CTC) layers. The NN may alternatively 
be viewed formally as a function (see Eq. 1) that converts an image (or matrix) M of size WH to a character 
sequence (cl, c2,...) of length 0 to L. Character-by-character analysis reveals that not just words and texts 
from the training data but also those not included there can be identified (as long as the individual 
characters get correctly classified). For this purpose, we employ the utilisation of a NN. Layers of 
convolutional NNs (CNNs), recurrent NNs (RNNs), and a Connectionist Temporal Classification (CTC) 
layer round out the architecture. The architecture of our HTR system is depicted in Fig. 2. In a more 
mathematical sense, we may think of the NN as a function (see Eq. 1) that converts an image (or matrix) M 
of size WH to a character sequence (cl, c2,...) of length 0 to L. As can be seen, the text is understood at the 
level of individual characters. Accordingly, recognition is not limited to data included in the training set (as 
long as the individual characters get correctly classified). 


NN: M > (Ci, C2, ..., Cn) 
WxH OsnsL 
Result and Discussion 


We've covered a NN that can read text from photos. The NN has 5 CNN layers and 2 RNN layers, and it 
produces a character-probability matrix as an output (fig.6). 


#) img 


LEMON MOJITO 4 580.00 
KACCHI KAIRI MARGARITA 1 145.00 
FRESH LIME SODA MIX 1 75.00 
FRESH LIME WATER MIX 1 75.00 
GREEN APPLE MOJITO 1 145.00 
BASMALAT 1 100.00 
Sub Total 6,770.00 

Service Charge 10% : 676.00 

Vat 12.5% ; 801.00 
Vat 20% : 208.00 
Service Tax 368 .00 


Total : 


Figure 6: Out put 


Depending on the situation, this matrix is substituted for the CTC loss computation or the CTC decoding. 
Important pieces of the code for a TF implementation are shown. Finally, suggestions were made to enhance 
the precision of the recognition. 


Conclusion 


A text recognizer scans and updates the database fields of manually entered bills and invoices, detecting and 
recognising characters as they are written. The process of automatically digitising printed and handwritten 
text from an image for the purposes of adding it to a database. We begin by scanning a bill and extracting the 
text using OCR with deep learning, where text detection is performed in EAST in CNN; next, recognition is 
implemented by RNN and the extracted text is compared to the trained dataset using the IAM dataset; if it is 
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recognised, the database is updated. Python's OpenCV packages were used to create the system for 
extracting text from bills and invoices. OpenCV and Nanonet OCR were both updated to their most recent 
versions. Effectively detect the banknotes using pre-processing, a segmentation method, and text extraction. 
IAM is used to analyse the data set and the database is updated automatically. To capture the photograph and 
continue the process, we are designing the android studio for usage in a mobile phone. 
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